Overview

Dataset statistics

Number of variables34
Number of observations970
Missing cells0
Missing cells (%)0.0%
Duplicate rows161
Duplicate rows (%)16.6%
Total size in memory237.9 KiB
Average record size in memory251.1 B

Variable types

Categorical3
Boolean25
Numeric6

Alerts

Dataset has 161 (16.6%) duplicate rowsDuplicates
wordcount is highly correlated with pagecount and 2 other fieldsHigh correlation
pagecount is highly correlated with wordcount and 3 other fieldsHigh correlation
openaccesscolor is highly correlated with filestatus and 8 other fieldsHigh correlation
reliability is highly correlated with manual_classification and 9 other fieldsHigh correlation
filestatus is highly correlated with isopenaccesstitle and 6 other fieldsHigh correlation
isopenaccesstitle is highly correlated with filestatus and 7 other fieldsHigh correlation
doi_in_oa is highly correlated with filestatus and 6 other fieldsHigh correlation
manual_classification is highly correlated with filestatus and 15 other fieldsHigh correlation
DOI_in_OA is highly correlated with filestatus and 6 other fieldsHigh correlation
DOI_no_PPT is highly correlated with filestatus and 7 other fieldsHigh correlation
PPT_in_name is highly correlated with reliability and 1 other fieldsHigh correlation
ppt_creator is highly correlated with reliability and 2 other fieldsHigh correlation
10_pics_page is highly correlated with reliability and 2 other fieldsHigh correlation
Contains_DOI is highly correlated with filestatus and 7 other fieldsHigh correlation
Contains_ISBN is highly correlated with filestatusHigh correlation
words_page>350 is highly correlated with reliability and 1 other fieldsHigh correlation
keyword_creator is highly correlated with reliabilityHigh correlation
Creative commons is highly correlated with isopenaccesstitle and 1 other fieldsHigh correlation
Words_more_300pp is highly correlated with manual_classification and 2 other fieldsHigh correlation
10>_Pagecount_<50 is highly correlated with openaccesscolor and 5 other fieldsHigh correlation
Kleiner_10_paginas is highly correlated with reliability and 3 other fieldsHigh correlation
Pagecount_bigger_50 is highly correlated with reliability and 3 other fieldsHigh correlation
images_same_pagecount is highly correlated with pagecountHigh correlation
Minder dan 50 woorden per pagina is highly correlated with reliability and 4 other fieldsHigh correlation
isfilepublished is highly correlated with filesizeHigh correlation
filesize is highly correlated with wordcount and 2 other fieldsHigh correlation
picturecount is highly correlated with wordcount and 2 other fieldsHigh correlation
wordcount_o is highly correlated with Minder dan 50 woorden per paginaHigh correlation
wordcount has 29 (3.0%) zeros Zeros
filesize has 236 (24.3%) zeros Zeros
picturecount has 144 (14.8%) zeros Zeros
reliability has 38 (3.9%) zeros Zeros

Reproduction

Analysis started2022-11-09 10:37:59.741295
Analysis finished2022-11-09 10:38:11.813982
Duration12.07 seconds
Software versionpandas-profiling v3.4.0
Download configurationconfig.json

Variables

filestatus
Categorical

HIGH CORRELATION

Distinct3
Distinct (%)0.3%
Missing0
Missing (%)0.0%
Memory size7.7 KiB
50
822 
80
118 
90
 
30

Length

Max length2
Median length2
Mean length2
Min length2

Characters and Unicode

Total characters1940
Distinct characters4
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row50
2nd row50
3rd row50
4th row50
5th row50

Common Values

ValueCountFrequency (%)
50822
84.7%
80118
 
12.2%
9030
 
3.1%

Length

2022-11-09T11:38:11.881181image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-11-09T11:38:11.964961image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
ValueCountFrequency (%)
50822
84.7%
80118
 
12.2%
9030
 
3.1%

Most occurring characters

ValueCountFrequency (%)
0970
50.0%
5822
42.4%
8118
 
6.1%
930
 
1.5%

Most occurring categories

ValueCountFrequency (%)
Decimal Number1940
100.0%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
0970
50.0%
5822
42.4%
8118
 
6.1%
930
 
1.5%

Most occurring scripts

ValueCountFrequency (%)
Common1940
100.0%

Most frequent character per script

Common
ValueCountFrequency (%)
0970
50.0%
5822
42.4%
8118
 
6.1%
930
 
1.5%

Most occurring blocks

ValueCountFrequency (%)
ASCII1940
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
0970
50.0%
5822
42.4%
8118
 
6.1%
930
 
1.5%

isfilepublished
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
649 
True
321 
ValueCountFrequency (%)
False649
66.9%
True321
33.1%
2022-11-09T11:38:12.035037image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

wordcount
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct346
Distinct (%)35.7%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean40797.37732
Minimum0
Maximum2095781
Zeros29
Zeros (%)3.0%
Negative0
Negative (%)0.0%
Memory size7.7 KiB
2022-11-09T11:38:12.122524image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile112
Q12291
median9577
Q335521
95-th percentile203232.65
Maximum2095781
Range2095781
Interquartile range (IQR)33230

Descriptive statistics

Standard deviation134514.6039
Coefficient of variation (CV)3.29713851
Kurtosis170.9958669
Mean40797.37732
Median Absolute Deviation (MAD)9041.5
Skewness11.87540349
Sum39573456
Variance1.809417866 × 1010
MonotonicityNot monotonic
2022-11-09T11:38:12.226742image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
8169452
 
5.4%
957744
 
4.5%
11232
 
3.3%
029
 
3.0%
32926
 
2.7%
21615622
 
2.3%
3552114
 
1.4%
84112
 
1.2%
20267012
 
1.2%
6014210
 
1.0%
Other values (336)717
73.9%
ValueCountFrequency (%)
029
3.0%
911
 
0.1%
11232
3.3%
1232
 
0.2%
1241
 
0.1%
1321
 
0.1%
1371
 
0.1%
1431
 
0.1%
2051
 
0.1%
2181
 
0.1%
ValueCountFrequency (%)
20957813
 
0.3%
8518953
 
0.3%
2825573
 
0.3%
2515004
 
0.4%
2506081
 
0.1%
2434951
 
0.1%
2162142
 
0.2%
21615622
2.3%
2161417
 
0.7%
2054641
 
0.1%

pagecount
Real number (ℝ≥0)

HIGH CORRELATION

Distinct163
Distinct (%)16.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean134.5525773
Minimum1
Maximum3395
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size7.7 KiB
2022-11-09T11:38:12.332379image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum1
5-th percentile1
Q123
median54
Q3126
95-th percentile524
Maximum3395
Range3394
Interquartile range (IQR)103

Descriptive statistics

Standard deviation291.0832723
Coefficient of variation (CV)2.163342228
Kurtosis59.36835078
Mean134.5525773
Median Absolute Deviation (MAD)40
Skewness6.671808435
Sum130516
Variance84729.47143
MonotonicityNot monotonic
2022-11-09T11:38:12.433370image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
160
 
6.2%
25656
 
5.8%
3444
 
4.5%
5233
 
3.4%
3127
 
2.8%
1025
 
2.6%
225
 
2.6%
52424
 
2.5%
3222
 
2.3%
5022
 
2.3%
Other values (153)632
65.2%
ValueCountFrequency (%)
160
6.2%
225
2.6%
34
 
0.4%
419
 
2.0%
512
 
1.2%
67
 
0.7%
72
 
0.2%
812
 
1.2%
1025
2.6%
111
 
0.1%
ValueCountFrequency (%)
33953
 
0.3%
22803
 
0.3%
15233
 
0.3%
14024
 
0.4%
11961
 
0.1%
10463
 
0.3%
9341
 
0.1%
80112
1.2%
6942
 
0.2%
5267
0.7%

isopenaccesstitle
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
935 
True
 
35
ValueCountFrequency (%)
False935
96.4%
True35
 
3.6%
2022-11-09T11:38:12.531389image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

filesize
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct210
Distinct (%)21.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean3.333246894
Minimum0
Maximum212.294282
Zeros236
Zeros (%)24.3%
Negative0
Negative (%)0.0%
Memory size7.7 KiB
2022-11-09T11:38:12.609604image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q10.069962
median0.768802
Q32.841351
95-th percentile15.334875
Maximum212.294282
Range212.294282
Interquartile range (IQR)2.771389

Descriptive statistics

Standard deviation12.59978078
Coefficient of variation (CV)3.780032257
Kurtosis233.3072435
Mean3.333246894
Median Absolute Deviation (MAD)0.768802
Skewness14.35288489
Sum3233.249487
Variance158.7544757
MonotonicityNot monotonic
2022-11-09T11:38:12.718080image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0236
24.3%
4.24259252
 
5.4%
0.76880244
 
4.5%
0.07849532
 
3.3%
2.83168426
 
2.7%
4.38999914
 
1.4%
15.44096812
 
1.2%
19.57769212
 
1.2%
0.14871612
 
1.2%
9.24669210
 
1.0%
Other values (200)520
53.6%
ValueCountFrequency (%)
0236
24.3%
0.0400131
 
0.1%
0.0406821
 
0.1%
0.0447831
 
0.1%
0.0451341
 
0.1%
0.0557211
 
0.1%
0.05771
 
0.1%
0.0699624
 
0.4%
0.07849532
 
3.3%
0.0825841
 
0.1%
ValueCountFrequency (%)
212.2942823
 
0.3%
62.4895791
 
0.1%
36.524861
 
0.1%
24.7833441
 
0.1%
19.9596972
 
0.2%
19.57769212
1.2%
19.0766391
 
0.1%
18.9965022
 
0.2%
16.757366
0.6%
15.4414086
0.6%
Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
964 
True
 
6
ValueCountFrequency (%)
False964
99.4%
True6
 
0.6%
2022-11-09T11:38:12.834129image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

openaccesscolor
Real number (ℝ≥0)

HIGH CORRELATION

Distinct6
Distinct (%)0.6%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean4.678350515
Minimum0
Maximum5
Zeros4
Zeros (%)0.4%
Negative0
Negative (%)0.0%
Memory size7.7 KiB
2022-11-09T11:38:12.889301image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile1
Q15
median5
Q35
95-th percentile5
Maximum5
Range5
Interquartile range (IQR)0

Descriptive statistics

Standard deviation1.046713546
Coefficient of variation (CV)0.2237355971
Kurtosis8.358643549
Mean4.678350515
Median Absolute Deviation (MAD)0
Skewness-3.153700561
Sum4538
Variance1.095609247
MonotonicityNot monotonic
2022-11-09T11:38:12.957993image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=6)
ValueCountFrequency (%)
5876
90.3%
159
 
6.1%
313
 
1.3%
412
 
1.2%
26
 
0.6%
04
 
0.4%
ValueCountFrequency (%)
04
 
0.4%
159
 
6.1%
26
 
0.6%
313
 
1.3%
412
 
1.2%
5876
90.3%
ValueCountFrequency (%)
5876
90.3%
412
 
1.2%
313
 
1.3%
26
 
0.6%
159
 
6.1%
04
 
0.4%

picturecount
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct170
Distinct (%)17.5%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean195.2247423
Minimum0
Maximum17844
Zeros144
Zeros (%)14.8%
Negative0
Negative (%)0.0%
Memory size7.7 KiB
2022-11-09T11:38:13.043020image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile0
Q15
median30
Q3100
95-th percentile762
Maximum17844
Range17844
Interquartile range (IQR)95

Descriptive statistics

Standard deviation1086.304152
Coefficient of variation (CV)5.564377443
Kurtosis217.6629539
Mean195.2247423
Median Absolute Deviation (MAD)30
Skewness13.99196065
Sum189368
Variance1180056.711
MonotonicityNot monotonic
2022-11-09T11:38:13.142991image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
0144
 
14.8%
3055
 
5.7%
1452
 
5.4%
233
 
3.4%
4228
 
2.9%
123
 
2.4%
321
 
2.2%
819
 
2.0%
133918
 
1.9%
13017
 
1.8%
Other values (160)560
57.7%
ValueCountFrequency (%)
0144
14.8%
123
 
2.4%
233
 
3.4%
321
 
2.2%
49
 
0.9%
514
 
1.4%
611
 
1.1%
712
 
1.2%
819
 
2.0%
93
 
0.3%
ValueCountFrequency (%)
178443
 
0.3%
68172
 
0.2%
50771
 
0.1%
39132
 
0.2%
20243
 
0.3%
14552
 
0.2%
13442
 
0.2%
133918
1.9%
8701
 
0.1%
80112
1.2%

reliability
Real number (ℝ≥0)

HIGH CORRELATION
ZEROS

Distinct41
Distinct (%)4.2%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean72.13917526
Minimum0
Maximum100
Zeros38
Zeros (%)3.9%
Negative0
Negative (%)0.0%
Memory size7.7 KiB
2022-11-09T11:38:13.247846image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Quantile statistics

Minimum0
5-th percentile30
Q166
median74
Q387
95-th percentile98
Maximum100
Range100
Interquartile range (IQR)21

Descriptive statistics

Standard deviation23.01787817
Coefficient of variation (CV)0.3190759818
Kurtosis2.082925924
Mean72.13917526
Median Absolute Deviation (MAD)10
Skewness-1.411627945
Sum69975
Variance529.8227155
MonotonicityNot monotonic
2022-11-09T11:38:13.357910image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram with fixed size bins (bins=41)
ValueCountFrequency (%)
66264
27.2%
9598
 
10.1%
3072
 
7.4%
7271
 
7.3%
7551
 
5.3%
9846
 
4.7%
8741
 
4.2%
8438
 
3.9%
038
 
3.9%
9631
 
3.2%
Other values (31)220
22.7%
ValueCountFrequency (%)
038
3.9%
11
 
0.1%
261
 
0.1%
3072
7.4%
412
 
0.2%
502
 
0.2%
554
 
0.4%
5812
 
1.2%
623
 
0.3%
633
 
0.3%
ValueCountFrequency (%)
10030
 
3.1%
993
 
0.3%
9846
4.7%
979
 
0.9%
9631
 
3.2%
9598
10.1%
931
 
0.1%
921
 
0.1%
916
 
0.6%
906
 
0.6%

doi_in_oa
Categorical

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size7.7 KiB
"False"
940 
"True"
 
30

Length

Max length7
Median length7
Mean length6.969072165
Min length6

Characters and Unicode

Total characters6760
Distinct characters9
Distinct categories3 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row"False"
2nd row"False"
3rd row"False"
4th row"False"
5th row"False"

Common Values

ValueCountFrequency (%)
"False"940
96.9%
"True"30
 
3.1%

Length

2022-11-09T11:38:13.458528image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category

Category Frequency Plot

2022-11-09T11:38:13.533525image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
ValueCountFrequency (%)
false940
96.9%
true30
 
3.1%

Most occurring characters

ValueCountFrequency (%)
"1940
28.7%
e970
14.3%
F940
13.9%
a940
13.9%
l940
13.9%
s940
13.9%
T30
 
0.4%
r30
 
0.4%
u30
 
0.4%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter3850
57.0%
Other Punctuation1940
28.7%
Uppercase Letter970
 
14.3%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e970
25.2%
a940
24.4%
l940
24.4%
s940
24.4%
r30
 
0.8%
u30
 
0.8%
Uppercase Letter
ValueCountFrequency (%)
F940
96.9%
T30
 
3.1%
Other Punctuation
ValueCountFrequency (%)
"1940
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin4820
71.3%
Common1940
28.7%

Most frequent character per script

Latin
ValueCountFrequency (%)
e970
20.1%
F940
19.5%
a940
19.5%
l940
19.5%
s940
19.5%
T30
 
0.6%
r30
 
0.6%
u30
 
0.6%
Common
ValueCountFrequency (%)
"1940
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII6760
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
"1940
28.7%
e970
14.3%
F940
13.9%
a940
13.9%
l940
13.9%
s940
13.9%
T30
 
0.4%
r30
 
0.4%
u30
 
0.4%

manual_classification
Categorical

HIGH CORRELATION

Distinct11
Distinct (%)1.1%
Missing0
Missing (%)0.0%
Memory size7.7 KiB
eigen materiaal - powerpoint
267 
eigen materiaal - overig
233 
lange overname
138 
in onderzoek
102 
open access
65 
Other values (6)
165 

Length

Max length32
Median length28
Mean length21.45257732
Min length8

Characters and Unicode

Total characters20809
Distinct characters24
Distinct categories3 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st roweigen materiaal - powerpoint
2nd roweigen materiaal - powerpoint
3rd roweigen materiaal - powerpoint
4th roweigen materiaal - powerpoint
5th roweigen materiaal - powerpoint

Common Values

ValueCountFrequency (%)
eigen materiaal - powerpoint267
27.5%
eigen materiaal - overig233
24.0%
lange overname138
14.2%
in onderzoek102
 
10.5%
open access65
 
6.7%
middellange overname64
 
6.6%
verwijderverzoek verstuurd42
 
4.3%
eigen materiaal - titelindicatie38
 
3.9%
korte overname15
 
1.5%
onbekend4
 
0.4%

Length

2022-11-09T11:38:13.599391image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Histogram of lengths of the category
ValueCountFrequency (%)
eigen538
17.9%
538
17.9%
materiaal538
17.9%
powerpoint267
8.9%
overig233
7.7%
overname217
7.2%
lange138
 
4.6%
in102
 
3.4%
onderzoek102
 
3.4%
access65
 
2.2%
Other values (9)274
9.1%

Most occurring characters

ValueCountFrequency (%)
e3463
16.6%
a2140
10.3%
2042
9.8%
i1942
9.3%
r1584
 
7.6%
n1541
 
7.4%
o1314
 
6.3%
t978
 
4.7%
g973
 
4.7%
l844
 
4.1%
Other values (14)3988
19.2%

Most occurring categories

ValueCountFrequency (%)
Lowercase Letter18229
87.6%
Space Separator2042
 
9.8%
Dash Punctuation538
 
2.6%

Most frequent character per category

Lowercase Letter
ValueCountFrequency (%)
e3463
19.0%
a2140
11.7%
i1942
10.7%
r1584
8.7%
n1541
8.5%
o1314
 
7.2%
t978
 
5.4%
g973
 
5.3%
l844
 
4.6%
m819
 
4.5%
Other values (12)2631
14.4%
Space Separator
ValueCountFrequency (%)
2042
100.0%
Dash Punctuation
ValueCountFrequency (%)
-538
100.0%

Most occurring scripts

ValueCountFrequency (%)
Latin18229
87.6%
Common2580
 
12.4%

Most frequent character per script

Latin
ValueCountFrequency (%)
e3463
19.0%
a2140
11.7%
i1942
10.7%
r1584
8.7%
n1541
8.5%
o1314
 
7.2%
t978
 
5.4%
g973
 
5.3%
l844
 
4.6%
m819
 
4.5%
Other values (12)2631
14.4%
Common
ValueCountFrequency (%)
2042
79.1%
-538
 
20.9%

Most occurring blocks

ValueCountFrequency (%)
ASCII20809
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
e3463
16.6%
a2140
10.3%
2042
9.8%
i1942
9.3%
r1584
 
7.6%
n1541
 
7.4%
o1314
 
6.3%
t978
 
4.7%
g973
 
4.7%
l844
 
4.1%
Other values (14)3988
19.2%

DOI_in_OA
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
940 
True
 
30
ValueCountFrequency (%)
False940
96.9%
True30
 
3.1%
2022-11-09T11:38:13.676404image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

DOI_no_PPT
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
876 
True
94 
ValueCountFrequency (%)
False876
90.3%
True94
 
9.7%
2022-11-09T11:38:13.744093image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

PPT_in_name
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
886 
True
 
84
ValueCountFrequency (%)
False886
91.3%
True84
 
8.7%
2022-11-09T11:38:13.813600image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

ppt_creator
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
855 
True
115 
ValueCountFrequency (%)
False855
88.1%
True115
 
11.9%
2022-11-09T11:38:13.882107image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

wordcount_o
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
941 
True
 
29
ValueCountFrequency (%)
False941
97.0%
True29
 
3.0%
2022-11-09T11:38:13.972484image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

10_pics_page
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
True
681 
False
289 
ValueCountFrequency (%)
True681
70.2%
False289
29.8%
2022-11-09T11:38:14.062381image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Contains_DOI
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
876 
True
94 
ValueCountFrequency (%)
False876
90.3%
True94
 
9.7%
2022-11-09T11:38:14.133406image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Contains_ISBN
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
916 
True
 
54
ValueCountFrequency (%)
False916
94.4%
True54
 
5.6%
2022-11-09T11:38:14.197743image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
968 
True
 
2
ValueCountFrequency (%)
False968
99.8%
True2
 
0.2%
2022-11-09T11:38:14.259626image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

words_page>350
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
761 
True
209 
ValueCountFrequency (%)
False761
78.5%
True209
 
21.5%
2022-11-09T11:38:14.323247image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

keyword_creator
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
909 
True
 
61
ValueCountFrequency (%)
False909
93.7%
True61
 
6.3%
2022-11-09T11:38:14.387567image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Creative commons
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
958 
True
 
12
ValueCountFrequency (%)
False958
98.8%
True12
 
1.2%
2022-11-09T11:38:14.452166image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Words_more_300pp
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
612 
True
358 
ValueCountFrequency (%)
False612
63.1%
True358
36.9%
2022-11-09T11:38:14.516342image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

10>_Pagecount_<50
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
726 
True
244 
ValueCountFrequency (%)
False726
74.8%
True244
 
25.2%
2022-11-09T11:38:14.583214image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
889 
True
 
81
ValueCountFrequency (%)
False889
91.6%
True81
 
8.4%
2022-11-09T11:38:14.674407image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Kleiner_10_paginas
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
804 
True
166 
ValueCountFrequency (%)
False804
82.9%
True166
 
17.1%
2022-11-09T11:38:14.768241image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
963 
True
 
7
ValueCountFrequency (%)
False963
99.3%
True7
 
0.7%
2022-11-09T11:38:14.836893image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Pagecount_bigger_50
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
True
560 
False
410 
ValueCountFrequency (%)
True560
57.7%
False410
42.3%
2022-11-09T11:38:14.899294image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
965 
True
 
5
ValueCountFrequency (%)
False965
99.5%
True5
 
0.5%
2022-11-09T11:38:14.963852image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
957 
True
 
13
ValueCountFrequency (%)
False957
98.7%
True13
 
1.3%
2022-11-09T11:38:15.027504image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

images_same_pagecount
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
942 
True
 
28
ValueCountFrequency (%)
False942
97.1%
True28
 
2.9%
2022-11-09T11:38:15.088737image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Minder dan 50 woorden per pagina
Boolean

HIGH CORRELATION

Distinct2
Distinct (%)0.2%
Missing0
Missing (%)0.0%
Memory size1.1 KiB
False
811 
True
159 
ValueCountFrequency (%)
False811
83.6%
True159
 
16.4%
2022-11-09T11:38:15.155205image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Interactions

2022-11-09T11:38:10.017542image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.041816image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.662298image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.284373image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.872906image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.437843image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:10.106118image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.181406image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.759529image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.384259image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.965831image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.533934image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:10.202039image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.282790image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.864246image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.482333image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.063997image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.632642image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:10.290102image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.374110image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.968582image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.593715image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.154275image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.726829image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:10.386467image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.468006image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.085862image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.682835image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.246932image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.821197image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:10.482630image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:07.566890image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.188024image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:08.781399image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.344864image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
2022-11-09T11:38:09.923001image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Correlations

2022-11-09T11:38:15.247245image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Auto

The auto setting is an easily interpretable pairwise column metric of the following mapping: vartype-vartype : method, categorical-categorical : Cramer's V, numerical-categorical : Cramer's V (using a discretized numerical column), numerical-numerical : Spearman's ρ. This configuration uses the best suitable for each pair of columns.
2022-11-09T11:38:15.472545image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.
2022-11-09T11:38:15.601649image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.
2022-11-09T11:38:15.729613image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.
2022-11-09T11:38:15.876655image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.
2022-11-09T11:38:16.105101image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

2022-11-09T11:38:10.678341image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
A simple visualization of nullity by column.
2022-11-09T11:38:11.670028image/svg+xmlMatplotlib v3.5.3, https://matplotlib.org/
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

filestatusisfilepublishedwordcountpagecountisopenaccesstitlefilesizeincollectionopenaccesscolorpicturecountreliabilitydoi_in_oamanual_classificationDOI_in_OADOI_no_PPTPPT_in_nameppt_creatorwordcount_o10_pics_pageContains_DOIContains_ISBNcreator_abbyywords_page>350keyword_creatorCreative commonsWords_more_300pp10>_Pagecount_<50Contains_copyrightKleiner_10_paginasfilename_indicatorPagecount_bigger_50book_and_words<10000Contains_published_inimages_same_pagecountMinder dan 50 woorden per pagina
050False4592False0.147544False5030"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalse
150False2308583False0.000000False525166"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
250False37047105False0.000000False532172"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalse
350False2421881False0.000000False525366"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
450False2421881False0.000000False525366"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
550False2421881False0.000000False525366"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
650False2421881False0.000000False525366"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
750False1477056False0.000000False517666"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse
850False1605750False0.000000False515266"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalse
950False2308583False0.000000False525166"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalse

Last rows

filestatusisfilepublishedwordcountpagecountisopenaccesstitlefilesizeincollectionopenaccesscolorpicturecountreliabilitydoi_in_oamanual_classificationDOI_in_OADOI_no_PPTPPT_in_nameppt_creatorwordcount_o10_pics_pageContains_DOIContains_ISBNcreator_abbyywords_page>350keyword_creatorCreative commonsWords_more_300pp10>_Pagecount_<50Contains_copyrightKleiner_10_paginasfilename_indicatorPagecount_bigger_50book_and_words<10000Contains_published_inimages_same_pagecountMinder dan 50 woorden per pagina
96050False2381False0.200540False51030"False"eigen materiaal - overigFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalse
96150False06False2.033238False5680"False"eigen materiaal - overigFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueTrue
96250False35521159False4.389999False5083"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse
96350False35521159False4.389999False5083"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse
96450False35521159False4.389999False5083"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse
96550False35521159False4.389999False5083"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse
96650False35521159False4.389999False5083"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse
96750False35521159False4.389999False5083"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse
96850False35521159False4.389999False5083"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse
96950True53348169False0.000000False5787"False"lange overnameFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalse

Duplicate rows

Most frequently occurring

filestatusisfilepublishedwordcountpagecountisopenaccesstitlefilesizeincollectionopenaccesscolorpicturecountreliabilitydoi_in_oamanual_classificationDOI_in_OADOI_no_PPTPPT_in_nameppt_creatorwordcount_o10_pics_pageContains_DOIContains_ISBNcreator_abbyywords_page>350keyword_creatorCreative commonsWords_more_300pp10>_Pagecount_<50Contains_copyrightKleiner_10_paginasfilename_indicatorPagecount_bigger_50book_and_words<10000Contains_published_inimages_same_pagecountMinder dan 50 woorden per pagina# duplicates
7750False81694256False4.242592False53066"False"lange overnameFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalse48
13880False957734False0.768802False11475"False"middellange overnameFalseTrueFalseFalseFalseTrueTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalse44
9050True1121False0.078495False5030"False"eigen materiaal - overigFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalse32
550False32931False2.831684False54298"False"eigen materiaal - powerpointFalseFalseFalseTrueFalseTrueFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseTrue26
8450False216156524False15.440968False5133972"False"eigen materiaal - overigFalseFalseFalseFalseFalseTrueFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalse12
1250False8414False0.148716False5030"False"eigen materiaal - overigFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalse10
7350False60142225False9.246692False522684"False"open accessFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseTrueFalseFalseFalseFalse10
1450False88032False1.426790False51895"False"eigen materiaal - powerpointFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseTrue9
5850False1568952False1.171354False52066"False"eigen materiaal - overigFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalse9
250False010False0.447311False510079"False"eigen materiaal - overigFalseFalseFalseFalseTrueTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseTrue8